Dataset overlap density analysis

نویسنده

  • Andreas H. Göller
چکیده

The need to compare compound datasets arises from various scenarios, like mergers, library extension programs, gap analysis, combinatorial library design, or estimation of QSAR model applicability domains. Whereas it is relatively easy to find identical compounds in two datasets, the quantification of the overlap is not straightforward. The various approaches described include pairwise nearest neighbor comparisons, clustering and mixed cluster statistics, or binning of e.g. rule-of-five property space distributions. The BCUT methodology creates a binned N-dimensional space and allows to assess the amount of mixed cells. ChemGPS creates a PCA reference projection based on drug-like and satellite molecules in property space to classify new compounds. But is it possible and also plausible to quantify the overlap of two datasets in a single interpretable number? PCA projection models with the World Drug Index as drug-like reference space were created based on MACCS, ECFP4, estate or Lipinsky-like physchem descriptors. Compounds from the commercial vendor i-research library, ZINC, ChEMBL and a current screening subset from PubChem were projected onto the WDI maps. The dataset overlap density index DOD is calculated from the summations over the occupancies of each N-dimensional “volume” element occupied by both datasets, divided by all such elements populated by at least one dataset. The index provides a measure of the overlap of two sets. It is shown that the number of principal components needed to describe at least 75% of the information content of the descriptor greatly varies and that a projection in 2 dimensions is not adequate. Such N-dimensional projections are extremely sparse (about 1043 elements for WDI and MACCS descriptor) and crowded only in small regions of the spanned N-dimensional space. The approach is universal to any descriptor. It can be extended to a DOD vector based on different descriptor types each describes different characteristics of the encoded molecules. The box element graining can be easily adjusted as needed for a particular application. based on needs. It allows to quantify local gaps or overlaps. Proprietary datasets can be compared just by the first N principal component values without even seeing the descriptors behind.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Space-use scaling and home range overlap in primates.

Space use is an important aspect of animal ecology, yet our understanding is limited by a lack of synthesis between interspecific and intraspecific studies. We present analyses of a dataset of 286 estimates of home range overlap from 100 primate species, with comparable samples for other space-use traits. To the best of our knowledge, this represents the first multispecies study using overlap d...

متن کامل

Robust Estimation in Linear Regression Model: the Density Power Divergence Approach

The minimum density power divergence method provides a robust estimate in the face of a situation where the dataset includes a number of outlier data. In this study, we introduce and use a robust minimum density power divergence estimator to estimate the parameters of the linear regression model and then with some numerical examples of linear regression model, we show the robustness of this est...

متن کامل

Indirect Relations in Yeast Protein Interactome

Figure 1: Overlap of the datasets. As pointed out frequently, there is very little overlap of observed interactions among yeast proteins when more than one method [4] is combined or when more than one experiment with the same method [1] is compared. As an example, focusing on the two yeast two-hybrid (Y2H) interaction datasets by Ito et al. (“core data”) [1] and Uetz et al. [3], the overlap of ...

متن کامل

A Scheduling Approach with Respect to Overlap of Computing and Data Transferring in Grid Computing

In this paper, we present a two-level distributed schedule model, and propose a scheduling approach with respect to overlap of computing and data transferring. On the basis of network status, node load, and the relation between task execution and task data access, data transferring and computing can occur concurrently in the following three cases: a) A task is being executed on a part of its da...

متن کامل

EHU-ALM: Similarity-Feature Based Approach for Student Response Analysis

We present a 5-way supervised system based on syntactic-semantic similarity features. The model deploys: Text overlap measures, WordNet-based lexical similarities, graphbased similarities, corpus-based similarities, syntactic structure overlap and predicateargument overlap measures. These measures are applied to question, reference answer and student answer triplets. We take into account the ne...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره 5  شماره 

صفحات  -

تاریخ انتشار 2013